Project : Wrangling and Analyze Data

Data Gathering

In the cell below, gather all three pieces of data for this project and load them in the notebook. Note : the methods required to gather each data are different.

1) Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

2) Use the Requests library to download the tweet image prediction (image_predictions.tsv)

3) Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

Assessing Data

In this section, detect and document at least eight (8) quality issues and two (2) tidiness issue. You must use both visual assessment programmatic assessement to assess the data.

Note : pay attention to the following key points when you access the data.

Quality issues

archive_df dataset :

1) Some values of columns are non-assigned numbers.

2) The values of columns of ids (tweet_id, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id and quoted_status_id) are floats or integers not strings.

3) The values of column source have link's tags.

4) Some of values of column name are in lowercase.

5) The colmuns timestamp and retweeted_status_timestamp are not in "datetime" format.

predictions_df dataset :

6) The values of column tweet_id are integers not strings.

7) Some of values of columns p1, p2 et p3 are in lowercase.

tweet_df dataset :

8) Some values of columns are non-assigned numbers.

9) The values of columns of ids (id, in_reply_to_status_id and in_reply_to_user_id) are floats or integers not strings.

10) The columns id_str, in_reply_to_status_id_str, in_reply_to_user_id_str and quoted_status_id_str are duplicates of the columns id, in_reply_to_status_id, in_reply_to_user_id and quoted_status_id.

11) The values of column display_text_range are intervalles.

12) The colomns geo, coordinates and contributors are empty.

13) The values of column lang are in lowercase.

14) The values of column source have link's tags.

15) The columns id and full_text should be renamed tweet_id and text respectively.

16) The columns entities, exrtended_entities and user should be removed.

Tidiness issues

tweet_df dataset :

1) The column created_at should be split into creation_date and creation_time columns.

2) The three tables would be merged into twitter_archive_master dataset.

Cleaning Data

In this section, clean all of the issues you documented while assessing.

Note : Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

Quality issues :

1 Some values of columns in archive_df dataset, are non-assigned numbers.

Define :

We will replace all of them with the empty object ('None').

Code

Test

2 The values of columns of ids (tweet_id, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id) in archive_df dataset, are floats or integers not strings.

Define :

IDs are better as strings, so we will redefine their type as object.

Code

Test

3 The values of column source in archive_df dataset, have link's tags.

Define :

Those values are linked to their download folder. Then we must remove the link's tags.

Code

Test

4 Some of values of column name in archive_df dataset, are in lowercase.

Define :

Names start with capital letters, so we will redefine their strings as titles.

Code

Test

5 The colmuns timestamp and retweeted_status_timestamp in archive_df dataset, are not in "datetime" format.

Define :

We will redefine their type as 'datetime'.

Code

Test

6 The values of column tweet_id in predictions_df dataset, are integers not strings.

Define :

One more time, ids are better as strings. We will redefine their type as object.

Code

Test

7 Some of values of columns p1, p2 et p3 in predictions_df dataset, are in lowercase.

Define :

They are names, so we will redefine their strings as titles.

Code

Test

8 Some values of columns in tweet_df dataset, are non-assigned numbers.

Define :

We will replace all of them with the empty object ('None').

Code

Test

9 The values of columns of ids (id, in_reply_to_status_id, in_reply_to_user_id and quoted_status_id) in tweet_df dataset, are floats or integers not strings.

Define :

We will redefine their type as object.

Code

Test

10 The columns id_str, in_reply_to_status_id_str, in_reply_to_user_id_str and quoted_status_id_str in tweet_df dataset, are duplicates of the columns id, in_reply_to_status_id and in_reply_to_user_id.

Define :

We must remove them and keep the originals.

Code

Test

11 The values of column display_text_range in tweet_df dataset, are intervalles.

Define :

Code

Test

12 The colomns geo, coordinates and contributors in tweet_df dataset, are empty.

Define :

Since they don't have any data, we must delete those columns.

Code

Test

13 The values of column lang in tweet_df dataset, are in lowercase.

Define :

We will redefine their strings as titles.

Code

Test

14 The values of column sources in tweet_df dataset, have link's tags.

Define :

One more time, those values are linked to their download folder. Then we must remove the link's tags.

Code

Test

15 The column id and full_text in tweet_df dataset, should be renamed tweet_id and text respectively.

Define :

There are standard names for all the twitter dataset. So id and full_text should be tweet_id and text respectively.

Code

Test

16 The columns entities and extended_entities in tweet_df dataset, should be removed.

Define :

Since they are very dirty, we must delete those columns.

Code

Test

Tidiness issues :

1 The column created_at in tweet_df dataset, should be split into creation_date and creation_time columns.

Define :

Code

Test

2 The three datasets would be merged into one twitter_archive_master dataset.

Define :

According to the Storing Data section, we need one master dataset from the three datasets archive_clean, predictions_clean and tweet_clean.

Code

Test 1

Test 2 :

Test 3 :

Storing Data

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

Analyzing and Visualizing Data

In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.

Insight 1 :

The count of the number of times the tweet is favorited (or not) according to the images number in the tweet and the booleans retweeted and favorited.

Insight 2 :

The count of the dogs ratings according to their names and the images number in the tweet.

Insight 3 :

The mean of text size according to the tweet source and the user language.

Visualization

We chose a visualization of dog ratings according to its count as favorite, knowing that the thickness of a point corresponds to the text size and the color, to the images number.